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ABSTRACT 


In this study, we attempted to find the optimal hyper-parameters of 
the convolutional recurrent neural network (CRNN) by investigating its 
performance on acoustic event detection. Important hyper-parameters such 
as the input segment length, learning rate, and criterion for the convergence test, 
were determined experimentally. Additionally, the effects of batch normalization 


and dropout on the performance were measured experimentally to obtain their 
optimal combination. Further, we studied the effects of varying the batch 
data on every iteration during the training. From the experimental results 
using the TUT sound events synthetic 2016 database, we obtained optimal 
performance with a learning rate of 10~*. We found that a longer input 
segment length aided performance improvement, and batch normalization 
was far more effective than dropout. Finally, performance improvement was 
clearly observed by varying the starting points of the batch data for each iteration 
during the training. 
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1, INTRODUCTION 

Recently, there has been increasing research interests in acoustic event detection, where the existence 
and occurrence times of the various sounds in our daily lives are identified. There are many applications for 
acoustic event detecton, including surveillance [1, 2], urban sound analysis [3, 4], information retrieval from 
multimedia contents [5], health care monitoring [6-7] and bird call detection [8, 9]. Deep neural networks 
(DNNs) have demonstrated superior performance to conventional machine learning techniques in image 
classification [10-12], speech recognition [13-15], and machine translation [16, 17]. In [18], we see that 
the feedforward neural network (FNN) now outperforms the Gaussian mixture model (GMM) and support 
vector machine (SVM), which have traditionally been employed for acoustic event detection. FNN has 
also been shown to outperform the conventional GMM-HMM-based methods in polyphonic acoustic 
event detection [19]. Therefore, we can say that current studies on acoustic event detection mainly focuse on 
DNN-based approaches. 

However, due to the fixed connection between the input and hidden layers, FNN is apparently 
inadequate to overcome the signal distortions in image classification. Similarly, it is apparent that FNN 
is also insufficient for acoustic event detection as audio signal distortions are frequently encountered in 
the 2-dimensional time-frequency domain of the signal. Another problem with FNN lies in modeling the correlation 
between the time-frames of the audio signal. As FNN only concatenates several input frames together to model 
the time correlation, it often fails to model the long-term time correlations of the audio signal. 
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Convolutional neural networks (CNNs) can alleviate the limitation of FNN using 2-dimensional 
filters, whose parameters are shared along the time and frequency shift [20], and it has exhibited superior 
performance to FNN in various pattern recognition tasks. Particularly, due to its structural characteristics, 
CNN can efficiently handle image distortions in the 2-dimensional space [10]. Similarly, we expect that 
the time-frequency domain distortions occurring in the audio signal can be accurately modeled by CNN. 
Nevertheless, CNN is inefficient for modeling long-term time correlations between audio signal samples. 

Recurrent neural networks (RNNs) have been used successfully in speech recognition [13], 
and are superior to other neural networks in modeling the time-correlation information of the time-series 
signals such as speech and audio. However, because RNN is unable to tackle 2-dimensional distortions in 
the time-frequency domain of the audio signal, its performance is usually inferior to CNN when used alone in 
acoustic event detection. Recently, there have been some approaches to combine CNN and RNN for their 
combined merits. Among them, convolutional recurrent neural networks (CRNNs) have been used 
successfully for acoustic event detection [21], speech recognition [22], and music classification [23]. 
As CRNN is constructed by connecting CNN, RNN, and FNN in series, it is more complex than other neural 
networks; therefore, the combined effect of the networks is difficult to predict. Moreover, as the use of CRNN 
on acoustic event detection is in its early stages, there are few research studies on optimizing the various 
hyper-parameters of CRNN. 

Therefore, we attempted to find the optimal hyper-parameters of CRNN for acoustic event 
detection in this study. Several experiments were performed to identify the optimal hyper-parameters of 
CRNN, and we used the test results on the validation data to determine the hyper-parameters. 
Important hyper-parameters, such as the input segment length, learning rate, and criterion for 
the convergence test were determined from the experiments. Additionally, the effects of batch normalization 
and dropout on the performance were observed. We also studied the effects of varying the batch data in every 
iteration during the training. This paper is organized as follows. In Section 2, we introduce the feature 
extraction method for audio signals as well as the architecture of the CRNN used for acoustic event detection. 
In Section 3, we present and discuss various experimental results, and conclusions are given in Section 4. 


2. FEATURE EXTRACTION AND CRNN ARCHITECTURE 
2.1. Feature extraction 

In this study, we used log-mel filterbank (LMFB) outputs as input features for CRNN and the entire 
process of feature extraction is shown in Figure 1. We first computed short-time Fourier transform (STFT) 
from the 40-ms audio signals, which are sampled at 44.1 KHz. STFTs were computed at every 20 ms with 
50% overlap [21]. Further, 40-dimensional mel filterbanks were extracted from the STFT's spanning 0 to 
22050 Hz, and they were log-transformed to obtain the LMFBs, which are normalized by subtracting 
the mean and dividing by standard deviation of the entire training data. 


44.1 KHz Hamming Short-Time Mel-frequency | Log Log-mel 
Sound signal "| Windowing} | Fourier Transform| —_| Filtering Transform Filterbank (LMFB) 





Figure 1. Extraction process of log-mel filterbank (LMFB) 


2.2. The architecture of CRNN 

Figure 2 presents the architecture of the CRNN used in this paper. CRNN consists of CNNs 
followed by RNN and FNN in sequence. The CNNs act as audio feature extractors, which are robust against 
distortions in the time-frequency domain. The RNN utilize the time-correlation information of the audio 
signals. Finally, the FNN serves as an output layer, which produces the posterior probabilities for each sound class at 
each time frame. 

As CNN takes 40-dimensional LMFBs as input features, the dimension of the input of the CNN 
is Tx40, where the length of the input segment is set to T. The CNN consists of 3 convolution layers, each of 
which has 256 feature maps with 5x5 filters. The output of the filter is processed by batch normalization 
and then passes through ReLU activation function. To maintain the time-domain dimension, max pooling 
is performed only in the frequency-domain. Further, dropout is applied at a rate of 0.25 after the max pooling 
layer [10]. The input segment length T is set to 1024 frames (20.48s). We experimented with different values 
of T to find the one that produced the best result on the validation dataset. 
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The output of the CNN is input to the RNN, which consists of 256 gated recurrent units (GRUs). 
The output layer consists of K units with sigmoid activation function, where K represents the number of 
acoustic event classes. The sigmoid activation function produces the posterior probabilities for each class at 
each time frame, from which we decide whether an acoustic event is active based on a threshold (0.5). 


Input: Log-Mel Filter Bank (1024 x 40) 


256 feature maps, 5x5, 2-D CNN 
batch normalization 
activation(ReLUs) 
5x1 max pooling 
dropout(0.25) 


256 feature maps, 5x5, 2-D CNN 
batch normalization 
activation(ReLUs) 
4x] max pooling 
dropout(0.25) 


236 feature maps, 5x5, 2-D CNN 
batch normalization 
activation(ReLUs) 
2x1 max pooling 
dropout(0.25) 





1024 *% 256 


256 units, GRU 
dropout(0.25) 
1024 X 256 


K(number of classes) units, time distributed dense 
layer, activation(Sigmoid) 





1024 * K 


Posterior probabilities for each class(K.) at each frame( 1024) 


Figure 2. Architecture of the CRNN used for acoustic event detection 


3. EXPERIMENTAL RESULTS 
3.1. Database and evaluation metric 

To evaluate the performance of CRNN on acoustic event detection, we used the TUT sound events 
synthetic 2016 (TUT-SED Synthetic) database, which is popularly used in this area [24]. TUT-SED Synthetic 
contains artificially generated audio data, because it is difficult to obtain enough data using only audio data 
recorded in real environments. Moreover, the subjective labelling error can be mitigated by artificial data. TUT-SED 
Synthetic was generated by mixing isolated sound events from 16 different classes. The total length of the data is 566 
minutes, which were divided into training, testing, and validation data with proportions of 60%, 20%, 
and 20%, respectively. Segments of length 3-15 seconds were used for the training, testing, and validation, 
and there were no common acoustic event instances between them. The detailed sound classes and their total 
duration in the database are shown in Table 1. 

For the evaluation metric of the acoustic event detection, we used both error rate (ER) and F-score [25]. 
We adopted two types of evaluation methods:, segment and event-based. In the segment-based method, 
the binarized outputs of the CRNN are compared with the ground truth table in every segment of length | s. 
In the event-based method, the output of the CRNN are compared with the label in the ground truth table 
whenever an acoustic event has been detected by CRNN [25]. We sought the optimal hyper-parameters of 
the CRNN by applying various conditions during the training. To find the optimal learning rate, 
we experimented as we changed the learning rate. The results are shown in Table 2. 

We applied batch normalization and dropout in all cases, and binary cross entropy was employed as 
the loss function, which acts as the criterion for the convergence of the weights. The Adam optimizer was 
used to optime the neural networks. As seen in Table 2, the optimal CRNN performance was achieved at 
a learning rate 10~*. Generally, as the optimal learning rate for the neural networks is not pre-determined, 
and varies by both network architecture and amount of training data, we searched for the optimal learning 
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rate by observing the performance of the validation data set. We can see in Table 2 that the best learning rate 
for the testing data is same as the learning rate that achieves the best performance on the validation data. 
This indicates that it 1s reasonable to determine the optimal learning rate of CRNN by its performance on 
the validation data. Table 2 also shows that as the learning rate decreases, the number of epochs that shows 
the best performance increases. For example, the number of epochs is 33 when the learning rate is 10~*, 
whereas it increases to 191 with the learning rate of 10~°. The larger number of epochs indicates a slower 
convergence of the CRNN parameters, which causes performance degradation due to the underfitting of the neural 
networks. In contrast, when the learning increases to 107%, the number of epochs decreases dramatically to 16, 
which causes overfitting and results in performance degradation. 

To investigate the performance variation with the learning rate further, we show, (as shown in Figure 3), 
the variation of the loss function and accuracy at the output of the CRNN during the training as the learning 
rate varied from 107* to 107’. When the learning rate is 10~*, it can be seen that the loss function on 
the validation data reaches its minimum at approximately 30 epochs (33 precisely) and subsequently 
fluctuates (but never falls below the minimum). However, for the training data, the loss function decreases 
from the beginning to the end of the training (we set the maximum number of epoch to 200). As it 
is important for the networks not to be overfitted, we stopped the iteration at 33 epochs by the early stopping 
algorithm mentioned previosuly. Meanwhile, we see different characteristics when the learning rate is 107°. 
The loss function on the validation data decreases for longer and reaches its minimum at 157. The longer 
iterations contribute to decrease the performance of the CRNN with both validation and test data due to 
the underfitttng problem. This phenomenon becomes more pronounced as we further decrease the learning rate. 
When the learning rate is 10~”, the loss function does not reach its minimum until the end of the training. A similar 
trend is observed when we monitor the accuracy instead of the loss function. 

We investigated the effect on performance as we changed the input batch data in every epoch of 
the training. The starting points of each batch data were shifted at each epoch making the input segments at 
successive epochs differ by the shift-length. For the evaluation, both segment-based and event-based methods 
were used and the results are shown in F-score and ER. We used ER as the convergence criterion for 
the training. Further, early stopping was employed where we stopped the training when the convergence 
criterion did not improve for more than 100 epochs on the validation data. This was to prevent overfitting, 
and the maximum number of epochs was set to 200. We also investigated the effects of batch normalization (BN) 
and dropout. BN has been used to mitigate the vanishing and exploding gradient problems in the backpropagation 
algorithm, and we used BN before the ReLU activation function in the convolution layer. Dropout is a popular 
regularization method in the neural networks, which is used to exclude the neurons from training at 
a predefined probability (dropout rate). In this study, dropout is used in all layers of the CNN and RNN (but not 
the FNN [18]. 

In Table 3, the results of using the shift of batch data are presented. In the segment-based evaluation, 
we see an improved average F-score/ER of 63.18%/0.52 using the shift compared to the F-score/ER 
of 62.11%/0.54 without the shift (non-shift). We also observe slight performance improvement in 
the event-based evaluation. From these results, we confirm that superior performance is expected by 
the batch data shift for training the CRNN. 

Table 3 also shows the effects of BN and dropout on performance. Expectedly, the best average F-score/ER 
is obtained when applying both BN and dropout (56.67%/0.71). If we apply only BN without dropout, we see 
slight performance degradation (54.54%/0.79), which implies that the effect of dropout on performances 
is not significant. Meanwhile, if we do not apply BN, the performance degrades significantly (regardless of 
whether we apply dropout), and the poorest result 1s obtained when we apply neither BN nor dropout 
(47.15%/0.81). In Table 4, we compare the performances of the CRNN between two convergence criteria (ER and 
F-score) for training. BN and dropout were applied and the overlap method was used. In the table, we observe 
slight performance improvementa with ER, but the performance difference appears negligible, and we 
conclude that the ER and F-scores can be used as the convergence criteria without significantly affecting 
the performance. 


Table 1. Sound classes and their total duration in seconds on the TUT-SED synthetic 2016 databases 


Classes _ Duration(s) Classes Duration(s) 
Glass 621 Motor cycle 3691 
Gun 534 Foot steps 1173 
Cat 941 Claps 3278 
Dog 716 Bus 3464 
Thunder 3007 Mixer 4020 
Bird 2298 Crowd 4825 
Horse 1614 Alarm 4405 
Crying 2007 Rain 3975 
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Table 2. Performance of CRNN as learning rate changes 
Leaning rate Validation data (F-score/ER) Testing data (F-score/ER) Epoch 
Segment Event Segment Event 
1073 61.69%/ 0.52 37.69%/0.96 60.61%/0.53  37.05%/0.97 16 
10-4 68.75%/ 0.45 43.49%/0.88 64.21%/0.50 40.50%/0.96 33 
107° 66.44%/ 0.49 39.10%/0.96 63.76%/0.52 36.48%/1.04 157 
10° 44.16%/ 0.69 9.83%/1.24 43.38%/0.71 10.82%/1.27 19] 
Accuracy on Training Data Accuracy on Training Data 
—— train acc — train acc 
—— val acc —— val acc 
Accuracy on Validation Data 
Accuracy on Validation Data 
8 8 
Loss Function on Validation Data 
Loss Function on Validation Data 
ee train loss — train loss 
Loss Function on Training =—— Loss Function on Training —— 
0 20 40 60 80 100 120 o 23 5s 7% 100 125 150 175 200 
epoch epoch 
Learning rate= 10~* Learning rate=10~° 
Accuracy on Training Data Accuracy on Training Data 
— train acc n70 — train ace 
—— val acc 065 — val acc 
eeu ale Vera U ae said Accuracy on Validation Data 
055 
¥ e “ 0.50 4 = 
: ' ~ 045 Loss Function on Training Data 3 
Loss Function on Validation Data 040 
0.35 
o—e tron [06s Loss Function Validation Data —— train loss 
Loss Function on Training ee — 
0 2% sO 7 100 125 150 175 200 o ss SOS 8 WO 125 150 175 20 
epoch epoch 
Learning rate=10-° Learning rate= 107 
Figure 3. Variation of loss function and accuracy as learning rate changes 
Table 3. Performance of CRNN with various training conditions 
BN Drop out Segment-based (F-score/ER) Event-based (F-score/ER) Average 
Shift Non-shift Shift Non-shift 
Yes No 66.10%/0.48 65.28%/0.54 43.80%/1.01 42.99%/1.12 54.54%/0.79 
Yes Yes 67.24%/0.49 66.62%/0.49 45.93%/0.94 46.88%/0.92 56.67%/0.71 
No No 58.58%/0.57 58.88%/0.58 35.47%/1.06 35.68 %/1.04 47.15%/0.81 
No Yes 60.80%/0.54 57.67%/0.56 40.02%/0.97 39.18%/1.00 49.4%/0.77 
Average 63.18%/0.52 62.11%/0.54 41.3%/1.0 41.18%/1.02 


Table 4. Performance of CRNN depending on the convergence criterion 


Convergence Criterion Segment-based (F-score/ER) Event-based (F-score/ER) Average 
ER 67.24%/0.49 45.93%/0.94 56.59%/0.72 
F-score 66.45%/0.5 1 44.97%/0.98 55.71%/0.75 


Finally, in Table 5, we show performance variation as we changed the length of the input segment of 
CRNN. Compared to shorter lengths (2.56s, 5.12s), longer length segments (10.24s, 20.56s) show better 
performances. This may be due to the fact that many of acoustic events in the TUT-SED Synthetic have long 
lengths and the RNN can efficiently capture the time-correlations in the long segments. 


Table 5. Performance of CRNN depending on the length of the input segment 


SegmentLength (s) Segment-based (F-scores/ER) | Event-based (F-scorse/ER) Average 
2.56 66.84%/0.5 1 42.75%/1.13 54.80%/0.82 
ie 67.47%/0.50 44.27%/1.04 55.87%/0.77 
10.24 68.01 %/0.47 45.33%/0.97 56.67%/0.72 
20.56 67.24%/0.49 45.93%/0.94 56.54%/0.70 
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4. CONCLUSION 

For acoustic event detection, approaches based on deep neural networks have demonstrated 
superior performances to conventional machine learning methods, such as GMM and SVM. Among them, 
CRNN is thought to be well suited for acoustic event detection due to its ability to reduce signal distortions in 
the time-frequency domain as well as in exploiting the temporal-correlation information of the audio signal. 
In this study, we empolyed CRNWN as the classifier for the acoustic event detection, and several of 
its conditions were tested by extensive experiments to determine its optimal hyper-parameters. 

In the experiments, by varying the learning rate, we found that the optimum performance is obtained 
when the learning rate is set to 10~*. From the results, we could also see that the learning rate that exhibits 
optimum performance on the validation data also performs best in the testing data. This suggests that it 
is reasonable to determine the optimal learning rate based on performance tests on the validation data. 
We further confirmed that BN and dropout contributed to improving the performance of CRNN. 
Particularly, BN had a larger impact on the performance improvement than dropout. 

Instead of using identical batch data at every iteration, we obtained improved performance by 
changing the batch data in every iteration, which resulted from increasing the number of training samples for 
CRNN. We further found that the length of the input segments of the CRNN also affects the performance. 
We obtained better performance using longer segments, as the acoustic event used in this paper had relatively 
long time-durations. 
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